Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from datetime import date
Template
spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.4 - Constant Values")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
| id | breed_id | nickname | birthday | age | color | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | 
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | 
| 2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | 
Constant Values
There are many instances where you will need to create a column expression or use a constant value to perform some of the spark transformations. We'll explore some of these.
Case 1: Creating a Column with a constant value (withColumn()) (wrong)
pets.withColumn('todays_date', date.today()).toPandas()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-f87e239cb534> in <module>()
----> 1 pets.withColumn('todays_date', date.today()).toPandas()
/usr/local/lib/python2.7/site-packages/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
   1846 
   1847         """
-> 1848         assert isinstance(col, Column), "col should be Column"
   1849         return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
   1850 
AssertionError: col should be Column
What Happened?
Spark functions that have a col as an argument will usually require you to pass in a Column expression. As seen in the previous section, withColumn() worked fine when we gave it a column from the current df. But this isn't the case when we want set a column to a constant value.
If you get an AssertionError: col should be Column that is usually the case, we'll look into how to fix this.
Case 1: Creating a Column with a constant value (withColumn()) (correct)
pets.withColumn('todays_date', F.lit(date.today())).toPandas()
| id | breed_id | nickname | birthday | age | color | todays_date | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | 2019-02-14 | 
| 1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | 2019-02-14 | 
| 2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | 2019-02-14 | 
What Happened?
With F.lit() you can create a column expression that you can now assign to a new column in your dataframe.
More Examples
(
    pets
    .withColumn('age_greater_than_5', F.col("age") > 5)
    .withColumn('height', F.lit(150))
    .where(F.col('breed_id') == 1)
    .where(F.col('breed_id') == F.lit(1))
    .toPandas()
)
| id | breed_id | nickname | birthday | age | color | age_greater_than_5 | height | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | False | 150 | 
| 1 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | True | 150 | 
What Happened?
(We will look into equilities statements later.)
The above contains constant values (column height) and column expressions (columns using F.col()) so a F.lit() is not required.
Summary
- You need to use F.lit()to assign constant values to columns.
- Equality expressions with F.col()is also another way to have a column expressions.
- When in doubt, always use column expressions F.lit().